<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Text Modeling</title>
    <meta charset="utf-8" />
    <meta name="author" content="Julia Silge | rstudio::conf | 28 Jan 2020" />
    <script src="libs/header-attrs/header-attrs.js"></script>
    <link href="libs/remark-css/default.css" rel="stylesheet" />
    <script src="https://use.fontawesome.com/5235085b15.js"></script>
    <link rel="stylesheet" href="css/xaringan-themer.css" type="text/css" />
    <link rel="stylesheet" href="css/footer_plus.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">




layout: true

&lt;div class="my-footer"&gt;&lt;span&gt;bit.ly/silge-rstudioconf-2&lt;/span&gt;&lt;/div&gt; 

---

class: inverse, center, bottom

background-image: url(figs/robert-bye-R-WtV-QyVnY-unsplash.jpg)
background-size: cover


# WELCOME!

### Text Mining Using Tidy Data Principles

---

class: inverse, center, middle

background-image: url(figs/p_and_p_cover.png)
background-size: cover


# Text Modeling

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

### USING TIDY PRINCIPLES

.large[Julia Silge | rstudio::conf | 28 Jan 2020]

---

class: middle, center

.pull-left[
# &lt;i class="fa fa-wifi"&gt;&lt;/i&gt;

Wifi network name  

.large[rstudio20]

]

.pull-left[
# &lt;i class="fa fa-key"&gt;&lt;/i&gt;

Wifi password

.large[tidyverse20]

]

---

&lt;img src="figs/blue_jane.png" style="position:absolute;top:30px;right:30px;" width="100px"/&gt;

## **Workshop policies**

--

- .large[Identify the exits closest to you in case of emergency] 

--

- .large[Please review the rstudio::conf code of conduct that applies to all workshops]

--

- .large[CoC issues can be addressed three ways:]

  - In person: contact any rstudio::conf staff member or the conference registration desk
  - By email: send a message to `conf@rstudio.com`
  - By phone: call 844-448-1212

--

- .large[Please do not photograph people wearing red lanyards]

--

- .large[A chill-out room is available for neurologically diverse attendees on the 4th floor of tower 1]

---

class: right, middle

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

# Find me at...

&lt;a href="http://twitter.com/juliasilge"&gt;&lt;i class="fa fa-twitter fa-fw"&gt;&lt;/i&gt;&amp;nbsp; @juliasilge&lt;/a&gt;&lt;br&gt;
&lt;a href="http://github.com/juliasilge"&gt;&lt;i class="fa fa-github fa-fw"&gt;&lt;/i&gt;&amp;nbsp; @juliasilge&lt;/a&gt;&lt;br&gt;
&lt;a href="https://juliasilge.com"&gt;&lt;i class="fa fa-link fa-fw"&gt;&lt;/i&gt;&amp;nbsp; juliasilge.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://tidytextmining.com"&gt;&lt;i class="fa fa-book fa-fw"&gt;&lt;/i&gt;&amp;nbsp; tidytextmining.com&lt;/a&gt;&lt;br&gt;
&lt;a href="mailto:julia.silge@gmail.com"&gt;&lt;i class="fa fa-paper-plane fa-fw"&gt;&lt;/i&gt;&amp;nbsp; julia.silge@gmail.com&lt;/a&gt;

---

class: left, top

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

# Meet your TAs

## 💫 Emil Hvitfelt (coordinator)

## 💥 Jeroen Claes

## ✨ Kasia Kulma

---

class: left, top

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

# Asking for help

--

## 🆘  "I'm stuck"

--

## ⚠️  "I'm not stuck, but I need help on my computer"

--

## 🙋  "I need help understanding something"

---

class: right, inverse, middle

background-image: url(figs/p_and_p_cover.png)
background-size: cover

# TIDYING AND CASTING 

&lt;h1 class="fa fa-check-circle fa-fw"&gt;&lt;/h1&gt;

---

background-image: url(figs/tmwr_0601.png)
background-size: 900px

---

class: inverse

background-image: url(figs/p_and_p_cover.png)
background-size: cover

# Two powerful NLP techniques

--

### 💡 Topic modeling

--

### 💡 Text classification

---

## Let's install some packages


```r
install.packages(c("tidyverse", 
                   "tidytext",
                   "gutenbergr",                   
                   "tidymodels",
                   "stm",
                   "glmnet"))
```


---

class: inverse

background-image: url(figs/p_and_p_cover.png)
background-size: cover

# Topic modeling

### 📖 Each DOCUMENT = mixture of topics

--

### 📑 Each TOPIC = mixture of tokens

---

class: top

background-image: url(figs/top_tags-1.png)
background-size: 800px

---

class: center, middle, inverse

background-image: url(figs/p_and_p_cover.png)
background-size: cover

# GREAT LIBRARY HEIST 🕵

---

## **Downloading your text data**


```r
library(tidyverse)
library(gutenbergr)

books &lt;- gutenberg_download(c(164, 36, 1342, 1400),
                            meta_fields = "title",
                            mirror = "https://www.gutenberg.org/")

books %&gt;%
  count(title)
```

```
## # A tibble: 4 x 2
##   title                                     n
##   &lt;chr&gt;                                 &lt;int&gt;
## 1 Great Expectations                    20024
## 2 Pride and Prejudice                   14203
## 3 The War of the Worlds                  6474
## 4 Twenty Thousand Leagues under the Sea 12135
```

---

## **Someone has torn your books apart!** 😭

.large[What do you predict will happen if we run the following code? 🤔]


```r
by_chapter &lt;- books %&gt;%
  group_by(title) %&gt;%
  mutate(chapter = cumsum(str_detect(text, 
                                     regex("^chapter ", 
                                           ignore_case = TRUE)))) %&gt;%
  ungroup() %&gt;%
  filter(chapter &gt; 0) %&gt;%
  unite(document, title, chapter)

glimpse(by_chapter)
```

---

## **Someone has torn your books apart!** 😭

.large[What do you predict will happen if we run the following code? 🤔]


```r
by_chapter &lt;- books %&gt;%
  group_by(title) %&gt;%
  mutate(chapter = cumsum(str_detect(text, 
                                     regex("^chapter ", 
                                           ignore_case = TRUE)))) %&gt;%
  ungroup() %&gt;%
  filter(chapter &gt; 0) %&gt;%
  unite(document, title, chapter)

glimpse(by_chapter)
```

```
## Rows: 38,578
## Columns: 3
## $ gutenberg_id &lt;int&gt; 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36...
## $ text         &lt;chr&gt; "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No o...
## $ document     &lt;chr&gt; "The War of the Worlds_1", "The War of the Worlds_1", ...
```

---

## **Can we put them back together?**


```r
library(tidytext)

word_counts &lt;- by_chapter %&gt;%
* unnest_tokens(word, text) %&gt;%
  anti_join(get_stopwords()) %&gt;%
  count(document, word, sort = TRUE)

glimpse(word_counts)
```

```
## Rows: 96,822
## Columns: 3
## $ document &lt;chr&gt; "Great Expectations_57", "Great Expectations_7", "Great Ex...
## $ word     &lt;chr&gt; "joe", "joe", "biddy", "joe", "estella", "joe", "said", "p...
## $ n        &lt;int&gt; 88, 70, 63, 58, 58, 56, 53, 53, 50, 50, 50, 50, 47, 46, 45...
```

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[The dataset `word_counts` contains]

- .large[the counts of words per book]
- .large[the counts of words per chapter]
- .large[the counts of words per line]

---

## **Can we put them back together?**


```r
words_sparse &lt;- word_counts %&gt;%
* cast_sparse(document, word, n)

class(words_sparse)
```

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

```r
dim(words_sparse)
```

```
## [1]   132 17048
```

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[Is `words_sparse` a tidy dataset?]

- .large[Yes ✅]
- .large[No 🚫]

---

## **Train a topic model**

Use a sparse matrix or a `quanteda::dfm` object as input


```r
library(stm)

topic_model &lt;- stm(words_sparse, K = 4, 
                   verbose = FALSE, 
                   init.type = "Spectral")
```
---

## **Train a topic model**

Use a sparse matrix or a `quanteda::dfm` object as input


```r
summary(topic_model)
```

```
## A topic model with 4 topics, 132 documents and a 17048 word dictionary.
```

```
## Topic 1 Top Words:
##  	 Highest Prob: captain, nautilus, one, sea, nemo, said, ned 
##  	 FREX: nautilus, nemo, ned, conseil, canadian, submarine, aronnax 
##  	 Lift: vanikoro, la, dillon, recherche, paradise, canoes, galleons 
##  	 Score: nautilus, nemo, ned, conseil, captain, canadian, ocean 
## Topic 2 Top Words:
##  	 Highest Prob: joe, said, now, biddy, one, upon, miss 
##  	 FREX: sergeant, joe, biddy, sister, pumblechook, joseph, orlick 
##  	 Lift: sergeant, galley, rul, clem, hulks, arthur, william 
##  	 Score: joe, biddy, pip, havisham, estella, pumblechook, mr 
## Topic 3 Top Words:
##  	 Highest Prob: said, mr, one, wemmick, herbert, know, now 
##  	 FREX: guardian, jaggers's, skiffins, walworth, rum, wemmick, jaggers 
##  	 Lift: jaggers's, jane, coiler, subordinate, look'ee, belinda, whimple 
##  	 Score: mr, wemmick, herbert, jaggers, pip, havisham, estella 
## Topic 4 Top Words:
##  	 Highest Prob: one, upon, said, martians, people, came, towards 
##  	 FREX: martians, martian, woking, mars, curate, pine, ulla 
##  	 Lift: martians, mars, curate, shepperton, henderson, hood, seaward 
##  	 Score: martians, martian, woking, cylinder, mars, curate, ulla
```


---

## **Exploring the output of topic modeling**



```r
chapter_topics &lt;- tidy(topic_model, matrix = "beta")

chapter_topics
```

```
## # A tibble: 68,192 x 3
##    topic term        beta
##    &lt;int&gt; &lt;chr&gt;      &lt;dbl&gt;
##  1     1 joe     2.09e-56
##  2     2 joe     1.70e- 2
##  3     3 joe     2.56e- 4
##  4     4 joe     1.55e-50
##  5     1 biddy   1.00e-58
##  6     2 biddy   5.63e- 3
##  7     3 biddy   6.42e- 5
##  8     4 biddy   9.70e-53
##  9     1 estella 1.20e-56
## 10     2 estella 3.44e- 3
## # ... with 68,182 more rows
```

---

## **Exploring the output of topic modeling**

.unscramble[U N S C R A M B L E]

```
top_terms &lt;- chapter_topics %&gt;%
```
```
ungroup() %&gt;%
```
```
group_by(topic) %&gt;%
```
```
arrange(topic, -beta)
```
```
top_n(10, beta) %&gt;%
```


---

## **Exploring the output of topic modeling**


```r
top_terms &lt;- chapter_topics %&gt;%
  group_by(topic) %&gt;%
  top_n(10, beta) %&gt;%
  ungroup() %&gt;%
  arrange(topic, -beta)
```

---

## **Exploring the output of topic modeling**


```r
top_terms
```

```
## # A tibble: 40 x 3
##    topic term        beta
##    &lt;int&gt; &lt;chr&gt;      &lt;dbl&gt;
##  1     1 captain  0.0119 
##  2     1 nautilus 0.0101 
##  3     1 one      0.00684
##  4     1 sea      0.00682
##  5     1 nemo     0.00674
##  6     1 said     0.00635
##  7     1 ned      0.00622
##  8     1 us       0.00578
##  9     1 conseil  0.00526
## 10     1 land     0.00468
## # ... with 30 more rows
```

---
## **Exploring the output of topic modeling**


```r
top_terms %&gt;%
  mutate(term = fct_reorder(term, beta)) %&gt;%
* ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()
```

---

![](modeling_files/figure-html/unnamed-chunk-13-1.png)&lt;!-- --&gt;

---

## **How are documents classified?**


```r
chapters_gamma &lt;- tidy(topic_model, matrix = "gamma",
                       document_names = rownames(words_sparse))

chapters_gamma
```

```
## # A tibble: 528 x 3
##    document              topic    gamma
##    &lt;chr&gt;                 &lt;int&gt;    &lt;dbl&gt;
##  1 Great Expectations_57     1 0.000115
##  2 Great Expectations_7      1 0.000102
##  3 Great Expectations_17     1 0.000149
##  4 Great Expectations_27     1 0.000171
##  5 Great Expectations_38     1 0.000133
##  6 Great Expectations_2      1 0.000133
##  7 Great Expectations_19     1 0.000155
##  8 Great Expectations_23     1 0.000171
##  9 Great Expectations_11     1 0.000160
## 10 Great Expectations_15     1 0.000137
## # ... with 518 more rows
```

---

## **How are documents classified?**

.large[What do you predict will happen if we run the following code? 🤔]


```r
chapters_parsed &lt;- chapters_gamma %&gt;%
  separate(document, c("title", "chapter"), 
           sep = "_", convert = TRUE)

chapters_parsed
```

---

## **How are documents classified?**

.large[What do you predict will happen if we run the following code? 🤔]


```r
chapters_parsed &lt;- chapters_gamma %&gt;%
  separate(document, c("title", "chapter"), 
           sep = "_", convert = TRUE)

glimpse(chapters_parsed)
```

```
## Rows: 528
## Columns: 4
## $ title   &lt;chr&gt; "Great Expectations", "Great Expectations", "Great Expectat...
## $ chapter &lt;int&gt; 57, 7, 17, 27, 38, 2, 19, 23, 11, 15, 18, 16, 22, 51, 9, 54...
## $ topic   &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ gamma   &lt;dbl&gt; 0.0001145730, 0.0001018362, 0.0001487511, 0.0001713696, 0.0...
```

---

## **How are documents classified?**

.unscramble[U N S C R A M B L E]

```
chapters_parsed %&gt;%
```
```
ggplot(aes(factor(topic), gamma)) +
```
```
facet_wrap(~ title)
```
```
mutate(title = fct_reorder(title, gamma * topic)) %&gt;%
```
```
geom_boxplot() +
```

---

## **How are documents classified?**


```r
chapters_parsed %&gt;%
  mutate(title = fct_reorder(title, gamma * topic)) %&gt;%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ title)
```

---

![](modeling_files/figure-html/unnamed-chunk-18-1.png)&lt;!-- --&gt;

---

class: center, middle, inverse

background-image: url(figs/p_and_p_cover.png)
background-size: cover

# GOING FARTHER 🚀

---

## Tidying model output

### Which words in each document are assigned to which topics?

- .large[`augment()`]
- .large[Add information to each observation in the original data]

---

background-image: url(figs/stm_video.png)
background-size: 850px

---

## **Using stm**

- .large[Document-level covariates]


```r
topic_model &lt;- stm(words_sparse, 
                   K = 0, init.type = "Spectral",
                   prevalence = ~s(Year),
                   data = covariates,
                   verbose = FALSE)
```

- .large[Use functions for `semanticCoherence()`, `checkResiduals()`, `exclusivity()`, and more!]

- .large[Check out http://www.structuraltopicmodel.com/]

- .large[See [my blog post](https://juliasilge.com/blog/evaluating-stm/) for how to choose `K`, the number of topics]

---


background-image: url(figs/model_diagnostic-1.png)
background-position: 50% 50%
background-size: 950px

---

# Stemming?

.large[Advice from [Schofield &amp; Mimno](https://mimno.infosci.cornell.edu/papers/schofield_tacl_2016.pdf)]

.large["Comparing Apples to Apple: The Effects of Stemmers on Topic Models"]

---

class: right, middle

&lt;h1 class="fa fa-quote-left fa-fw"&gt;&lt;/h1&gt;

&lt;h2&gt; Despite their frequent use in topic modeling, we find that stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability. &lt;/h2&gt;

&lt;h1 class="fa fa-quote-right fa-fw"&gt;&lt;/h1&gt;
---

## **Train many topic models**


```r
library(furrr)
# plan(multicore)
# plan(multisession)

*many_models &lt;- tibble(K = c(3, 4, 6, 8, 10)) %&gt;%
  mutate(topic_model = future_map(K, 
                                  ~stm(words_sparse, K = .,
                                       verbose = FALSE)))

many_models
```

```
## # A tibble: 5 x 2
##       K topic_model
##   &lt;dbl&gt; &lt;list&gt;     
## 1     3 &lt;STM&gt;      
## 2     4 &lt;STM&gt;      
## 3     6 &lt;STM&gt;      
## 4     8 &lt;STM&gt;      
## 5    10 &lt;STM&gt;
```

---

## **Train many topic models**


```r
heldout &lt;- make.heldout(words_sparse)

k_result &lt;- many_models %&gt;%
  mutate(exclusivity        = map(topic_model, exclusivity),
         semantic_coherence = map(topic_model, semanticCoherence, words_sparse),
         eval_heldout       = map(topic_model, eval.heldout, heldout$missing),
         residual           = map(topic_model, checkResiduals, words_sparse),
         bound              = map_dbl(topic_model, function(x) max(x$convergence$bound)),
         lfact              = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
         lbound             = bound + lfact,
         iterations         = map_dbl(topic_model, function(x) length(x$convergence$bound)))
```

---

## **Train many topic models**


```r
k_result
```

```
## # A tibble: 5 x 10
##       K topic_model exclusivity semantic_cohere~ eval_heldout residual   bound
##   &lt;dbl&gt; &lt;list&gt;      &lt;list&gt;      &lt;list&gt;           &lt;list&gt;       &lt;list&gt;     &lt;dbl&gt;
## 1     3 &lt;STM&gt;       &lt;dbl [3]&gt;   &lt;dbl [3]&gt;        &lt;named list~ &lt;named ~ -1.29e6
## 2     4 &lt;STM&gt;       &lt;dbl [4]&gt;   &lt;dbl [4]&gt;        &lt;named list~ &lt;named ~ -1.28e6
## 3     6 &lt;STM&gt;       &lt;dbl [6]&gt;   &lt;dbl [6]&gt;        &lt;named list~ &lt;named ~ -1.26e6
## 4     8 &lt;STM&gt;       &lt;dbl [8]&gt;   &lt;dbl [8]&gt;        &lt;named list~ &lt;named ~ -1.25e6
## 5    10 &lt;STM&gt;       &lt;dbl [10]&gt;  &lt;dbl [10]&gt;       &lt;named list~ &lt;named ~ -1.24e6
## # ... with 3 more variables: lfact &lt;dbl&gt;, lbound &lt;dbl&gt;, iterations &lt;dbl&gt;
```

---

## **Train many topic models**


```r
k_result %&gt;%
  transmute(K,
            `Lower bound`         = lbound,
*           Residuals             = map_dbl(residual, "dispersion"),
*           `Semantic coherence`  = map_dbl(semantic_coherence, mean),
*           `Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %&gt;%
  gather(Metric, Value, -K) %&gt;%
  ggplot(aes(K, Value, color = Metric)) +
  geom_line() +
  facet_wrap(~Metric, scales = "free_y")
```

---

![](modeling_files/figure-html/unnamed-chunk-24-1.png)&lt;!-- --&gt;

---

## **What is semantic coherence?**

- .large[Semantic coherence is maximized when the most probable words in a given topic frequently co-occur together]

--

- .large[Correlates well with human judgment of topic quality 😃]

--

- .large[Having high semantic coherence is relatively easy, though, if you only have a few topics dominated by very common words 😢]

--

- .large[Measure semantic coherence **and** exclusivity]

---

## **Train many topic models**


```r
k_result %&gt;%
  select(K, exclusivity, semantic_coherence) %&gt;%
  filter(K %in% c(3, 6, 10)) %&gt;%
  unnest(cols = c(exclusivity, semantic_coherence)) %&gt;%
  ggplot(aes(semantic_coherence, exclusivity, 
             color = factor(K))) +
  geom_point()
```

---

![](modeling_files/figure-html/unnamed-chunk-26-1.png)&lt;!-- --&gt;

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[Topic modeling is an example of...]

- .unscramble[supervised machine learning]
- .unscramble[unsupervised machine learning]


---

class: right, middle, inverse

background-image: url(figs/p_and_p_cover.png)
background-size: cover


# TEXT CLASSIFICATION
&lt;h1 class="fa fa-balance-scale fa-fw"&gt;&lt;/h1&gt;

---

## **Downloading your text data**


```r
library(tidyverse)
library(gutenbergr)

titles &lt;- c("The War of the Worlds",
            "Pride and Prejudice")

books &lt;- gutenberg_works(title %in% titles) %&gt;%
  gutenberg_download(meta_fields = "title", 
                     mirror = "https://www.gutenberg.org/") %&gt;%
  mutate(document = row_number())

glimpse(books)
```

```
## Rows: 20,677
## Columns: 4
## $ gutenberg_id &lt;int&gt; 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36...
## $ text         &lt;chr&gt; "The War of the Worlds", "", "by H. G. Wells [1898]", ...
## $ title        &lt;chr&gt; "The War of the Worlds", "The War of the Worlds", "The...
## $ document     &lt;int&gt; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
```

---

## **Making a tidy dataset**

.large[Use this kind of data structure for EDA! 💅]


```r
library(tidytext)

tidy_books &lt;- books %&gt;%
* unnest_tokens(word, text) %&gt;%
  group_by(word) %&gt;%
  filter(n() &gt; 10) %&gt;%
  ungroup

glimpse(tidy_books)
```

```
## Rows: 159,742
## Columns: 4
## $ gutenberg_id &lt;int&gt; 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36...
## $ title        &lt;chr&gt; "The War of the Worlds", "The War of the Worlds", "The...
## $ document     &lt;int&gt; 1, 1, 1, 1, 3, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, ...
## $ word         &lt;chr&gt; "the", "war", "of", "the", "by", "but", "who", "shall"...
```

---

## **Create training and testing sets**

.large[What do you predict will happen if we run the following code? 🤔]


```r
library(rsample)

books_split &lt;- tidy_books %&gt;%
  distinct(document) %&gt;%
* initial_split()

train_data &lt;- training(books_split)
test_data &lt;- testing(books_split)
```


---

## **Cast to a sparse matrix**


```r
sparse_words &lt;- tidy_books %&gt;%
  count(document, word, sort = TRUE) %&gt;%
  inner_join(train_data) %&gt;%
* cast_sparse(document, word, n)

class(sparse_words)
```

```
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
```

```r
dim(sparse_words)
```

```
## [1] 12881  1654
```

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[Which `dim` of the sparse matrix is the number of features?]

- .large[12881]
- .large[1654]

.large[Feature = term = word]

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[If you want to use tf-idf instead of counts, should you calculate tf-idf before or after splitting train and test?]

- .large[Before]
- .large[After]

---

## **Build a dataframe with the response variable**



```r
word_rownames &lt;- as.integer(rownames(sparse_words))

books_joined &lt;- tibble(document = word_rownames) %&gt;%
  left_join(books %&gt;%
              select(document, title))

glimpse(books_joined)
```

```
## Rows: 12,881
## Columns: 2
## $ document &lt;int&gt; 6450, 1264, 1287, 1639, 1698, 1897, 2135, 2161, 2858, 2930...
## $ title    &lt;chr&gt; "The War of the Worlds", "The War of the Worlds", "The War...
```


---

## **Train a glmnet model**


```r
library(glmnet)
# library(doMC)
# registerDoMC(cores = 8)

is_jane &lt;- books_joined$title == "Pride and Prejudice"

model &lt;- cv.glmnet(sparse_words, is_jane, 
                   family = "binomial", 
                   # parallel = TRUE, 
                   keep = TRUE)
```

- .large[Regularization constrains magnitude of coefficients]

- .large[LASSO performs feature selection]


---

## **Tidying our model**

.large[Tidy, then filter to choose some lambda from glmnet output]


```r
library(broom)

coefs &lt;- model$glmnet.fit %&gt;%
  tidy() %&gt;%
  filter(lambda == model$lambda.1se)

Intercept &lt;- coefs %&gt;%
  filter(term == "(Intercept)") %&gt;%
  pull(estimate)
```

---

## **Tidying our model**

.unscramble[U N S C R A M B L E]

```
classifications &lt;- tidy_books %&gt;%
```
```
mutate(probability = plogis(Intercept + score))
```
```
inner_join(test_data) %&gt;%
```
```
group_by(document) %&gt;%
```
```
inner_join(coefs, by = c("word" = "term")) %&gt;%
```
```
summarize(score = sum(estimate)) %&gt;%
```

---

## **Tidying our model**


```r
classifications &lt;- tidy_books %&gt;%
  inner_join(test_data) %&gt;%
  inner_join(coefs, by = c("word" = "term")) %&gt;%
  group_by(document) %&gt;%
  summarize(score = sum(estimate)) %&gt;%
  mutate(probability = plogis(Intercept + score))

glimpse(classifications)
```

```
## Rows: 4,264
## Columns: 3
## $ document    &lt;int&gt; 1, 13, 15, 19, 25, 32, 36, 37, 48, 50, 51, 55, 56, 63, ...
## $ score       &lt;dbl&gt; -2.7327390, -0.3613760, -5.5202942, 0.3911530, -1.46135...
## $ probability &lt;dbl&gt; 0.1225755040, 0.5994320575, 0.0085281431, 0.7605374508,...
```

---

## **Understanding our model**

.unscramble[U N S C R A M B L E]

```
coefs %&gt;%
```
```
group_by(estimate &gt; 0) %&gt;%
```
```
coord_flip()
```
```
geom_col(show.legend = FALSE) +
```
```
ungroup %&gt;%
```
```
top_n(10, abs(estimate)) %&gt;%
```
```
ggplot(aes(fct_reorder(term, estimate), 
estimate, 
fill = estimate &gt; 0)) +
```

---

## **Understanding our model**


```r
coefs %&gt;%
  group_by(estimate &gt; 0) %&gt;%
  top_n(10, abs(estimate)) %&gt;%
  ungroup %&gt;%
  ggplot(aes(fct_reorder(term, estimate), 
             estimate, 
             fill = estimate &gt; 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip()
```


---

![](modeling_files/figure-html/unnamed-chunk-36-1.png)&lt;!-- --&gt;

---

## **ROC**

.large[What do you predict will happen if we run the following code? 🤔]


```r
comment_classes &lt;- classifications %&gt;%
  left_join(books %&gt;%
              select(title, document), by = "document") %&gt;%
  mutate(title = as.factor(title))

comment_classes
```

---

## **ROC**

.large[What do you predict will happen if we run the following code? 🤔]


```r
comment_classes &lt;- classifications %&gt;%
  left_join(books %&gt;%
              select(title, document), by = "document") %&gt;%
  mutate(title = as.factor(title))

glimpse(comment_classes)
```

```
## Rows: 4,264
## Columns: 4
## $ document    &lt;int&gt; 1, 13, 15, 19, 25, 32, 36, 37, 48, 50, 51, 55, 56, 63, ...
## $ score       &lt;dbl&gt; -2.7327390, -0.3613760, -5.5202942, 0.3911530, -1.46135...
## $ probability &lt;dbl&gt; 0.1225755040, 0.5994320575, 0.0085281431, 0.7605374508,...
## $ title       &lt;fct&gt; The War of the Worlds, The War of the Worlds, The War o...
```

---

## **ROC**


```r
library(yardstick)

comment_classes %&gt;%
* roc_curve(title, probability) %&gt;%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(size = 1.5) +
  geom_abline(
    lty = 2, alpha = 0.5,
    color = "gray50",
    size = 1.2
  )
```

---

![](modeling_files/figure-html/unnamed-chunk-40-1.png)&lt;!-- --&gt;

---

## **AUC for model**


```r
comment_classes %&gt;%
  roc_auc(title, probability)
```

```
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
## 1 roc_auc binary         0.970
```

---

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

## Jane wants to know...

.large[Is this the AUC for the training or testing data?]

- .large[Training]
- .large[Testing]

---

## **Confusion matrix**


```r
comment_classes %&gt;%
  mutate(
    prediction = case_when(
      probability &gt; 0.5 ~ "Pride and Prejudice",
      TRUE ~ "The War of the Worlds"
    ),
    prediction = as.factor(prediction)
  ) %&gt;%
* conf_mat(title, prediction)
```

```
##                        Truth
## Prediction              Pride and Prejudice The War of the Worlds
##   Pride and Prejudice                  2799                   211
##   The War of the Worlds                 142                  1112
```

---

## **Misclassifications**

Let's talk about misclassifications. Which documents here were incorrectly predicted to be written by Jane Austen?


```r
comment_classes %&gt;%
* filter(probability &gt; .8, title == "The War of the Worlds") %&gt;%
  sample_n(5) %&gt;%
  inner_join(books %&gt;%
               select(document, text)) %&gt;%
  select(probability, text)
```

```
## # A tibble: 5 x 2
##   probability text                                                              
##         &lt;dbl&gt; &lt;chr&gt;                                                             
## 1       0.882 wheels flung behind her, she receded with terrifying slowness from
## 2       0.966 to Brigadier-General Marvin, and tell him all you know.  He's at  
## 3       0.823 excavating and embanking in a methodical and discriminating manne~
## 4       0.820 CHAPTER FOURTEEN                                                  
## 5       0.932 It must be, if the nebular hypothesis has any truth, older than o~
```

---

## **Misclassifications**

Let's talk about misclassifications. Which documents here were incorrectly predicted to *not* be written by Jane Austen?


```r
comment_classes %&gt;%
* filter(probability &lt; .3, title == "Pride and Prejudice" ) %&gt;%
  sample_n(5) %&gt;%
  inner_join(books %&gt;%
               select(document, text)) %&gt;%
  select(probability, text)
```

```
## # A tibble: 5 x 2
##   probability text                                                              
##         &lt;dbl&gt; &lt;chr&gt;                                                             
## 1      0.296  "      “May I ask to what these questions tend?”"                 
## 2      0.0315 "      the road, the house standing in it, the green pales, and t~
## 3      0.105  "      the design of selecting a wife, as I certainly did.”"      
## 4      0.0311 "      They descended the hill, crossed the bridge, and drove to ~
## 5      0.147  "      walking together in the shrubbery behind the house, they s~
```

---

background-image: url(figs/tmwr_0601.png)
background-position: 50% 70%
background-size: 750px

## **Workflow for text mining/modeling**

---

background-image: url(figs/lizzieskipping.gif)
background-position: 50% 55%
background-size: 750px

# **Go explore real-world text!**

---

class: left, middle

&lt;img src="figs/blue_jane.png" width="150px"/&gt;

# Thanks!

&lt;a href="http://twitter.com/juliasilge"&gt;&lt;i class="fa fa-twitter fa-fw"&gt;&lt;/i&gt;&amp;nbsp; @juliasilge&lt;/a&gt;&lt;br&gt;
&lt;a href="http://github.com/juliasilge"&gt;&lt;i class="fa fa-github fa-fw"&gt;&lt;/i&gt;&amp;nbsp; @juliasilge&lt;/a&gt;&lt;br&gt;
&lt;a href="https://juliasilge.com"&gt;&lt;i class="fa fa-link fa-fw"&gt;&lt;/i&gt;&amp;nbsp; juliasilge.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://tidytextmining.com"&gt;&lt;i class="fa fa-book fa-fw"&gt;&lt;/i&gt;&amp;nbsp; tidytextmining.com&lt;/a&gt;&lt;br&gt;
&lt;a href="mailto:julia.silge@gmail.com"&gt;&lt;i class="fa fa-paper-plane fa-fw"&gt;&lt;/i&gt;&amp;nbsp; julia.silge@gmail.com&lt;/a&gt;

Slides created with [**remark.js**](http://remarkjs.com/) and the R package [**xaringan**](https://github.com/yihui/xaringan)

---

class: middle, center

# &lt;i class="fa fa-check-circle"&gt;&lt;/i&gt;

# Submit feedback before you leave

.large[[rstd.io/ws-survey](https://rstd.io/ws-survey)]

    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false,
"ratio": "16:9"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>
